www.fujitsu.com



## Fujitsu's challenge for Petascale Computing

October 9, 2008

Motoi Okuda

Technical Computing Solutions Group Fujitsu Limited

ORAP Forum, 9th Oct. 2008

**Practical** 



## Agenda

- Fujitsu's Approach for Petascale Computing and HPC Solution Offerings
- Japanese Next Generation Supercomputer Project and Fujitsu's Contributions
- Fujitsu's Challenges for Petascale Computing
- Conclusion



## Fujitsu's approach for Scaling up to 10 Pflops





## **Key Issues for Approaching Petascale Computing**

|   | How to utilize multi-core CPU?                                                                     |  |  |  |  |  |
|---|----------------------------------------------------------------------------------------------------|--|--|--|--|--|
|   | How to handle hundred thousand processes?                                                          |  |  |  |  |  |
| - | How to realize high reliability, availability and data integrity of hundred thousand nodes system? |  |  |  |  |  |
|   | How to decrease electric power and footprint?                                                      |  |  |  |  |  |

- Fujitsu's stepwise approach to product release ensures that customers can be prepared for Petascale computing
- Step1 : 2008 ~

The new high end technical computing server FX1

- New Integrated Multi-core Parallel ArChiTecture
- Intelligent interconnect
- Extremely reliable CPU design

- Provides a highly efficient hybrid parallel programming environment
- Design of Petascale system which inherits FX1 architecture
  - Step2 : 2011 ~
    - Petascale system with new high performance, high reliable and low power consumption CPU, innovative interconnect and high density packaging

## FUĴĨTSU

## **Current Technical Computing Platforms**

#### Large-scale SMP System High-end TC **Cluster Solutions** Solutions **Solutions** Optimal price/performance Up to 2TB memory space for Scalability up to for MPI-based applications TC applications 100Tflops class Highly scalable High I/O bandwidth for I/O Highly effective InfiniBand interconnect server performance High reliability based on • High-end RISC main- frame technology PRIMERGY CPU High-end RISC CPU **Solidware Solutions BX Series** Ultra high performance FX' PRIMEQUEST **SPARC Enterprise** for specific SPARC64™ VII **RX Series** PRIMEQUEST SPARC Enterprise M9000 applications 580 SPARC64<sup>TM</sup> VII Itanium<sup>®</sup> 2 -<u>32cpu</u> 64cpu -P500 **FPGA** board HX600 op 500 NEW spec IA/Linux SPARC/Solari SPARC/Solari **IA/Linux RG1000**

ORAP Forum, 9th Oct. 2008

All Rights Reserved, Copyright FUJITSU LIMITED 2008

## FUĴĨTSU

## **Customers of large scale TC systems**

#### Fujitsu has installed over 1200 TC systems for over 400 customers.

| Customer                                                                                   | Туре                  | No. of CPU | Performance |
|--------------------------------------------------------------------------------------------|-----------------------|------------|-------------|
| Japan Aerospace Exploration Agency (JAXA)<br>*This system will be installed in end of 2008 | Cluster<br>Scalar SMP | >3,500CPU  | 135TFlops   |
| Manufacturer A                                                                             | Scalar SMP<br>Cluster | >3,500CPU  | >80TFlops   |
| KYOTO University Computing Center                                                          | Cluster<br>Scalar SMP | >2,000CPU  | >61.2TFlops |
| KYUSYU University Computing Center                                                         | Scalar SMP<br>Cluster | 1,824CPU   | 32TFlops    |
| Manufacturer B                                                                             | Cluster               | >1,200CPU  | >15TFlops   |
| RIKEN                                                                                      | Cluster               | 3,088CPU   | 26.18TFlops |
| NAGOYA University Computing Center                                                         | Scalar SMP            | 1,600 CPU  | 13TFlops    |
| TOKYO University KAMIOKA Observatory                                                       | Cluster               | 540CPU     | 12.9TFlops  |
| National Institute of Genetic                                                              | Cluster<br>Scalar SMP | 324CPU     | 6.9TFlops   |
| Institute for Molecular Science                                                            | Scalar SMP            | 320CPU     | 4TFlops     |

## Latest case study

• Kyoto University is one of the biggest computing centers in Japan.



ORAP Forum, 9th Oct. 2008

<u>\_0</u>



## **FX1 Launch customer**

#### First system will be installed at JAXA by the end of 2008





## FX1 : New High-end TC Server - Outline -

#### • High-performance CPU designed by Fujitsu

- SPARC64<sup>TM</sup> VII : 4 cores by 65nm technology
- Performance : 40 Gflops (2.5GHz)



- New architecture for high-end TC server
  - Integrated Multi-core Parallel ArChiTecture by leading edge CPU and compiler technologies
  - Blade type node configuration for high memory bandwidth

#### High-speed intelligent interconnect

- Combination of InfiniBand DDR interconnect and the highly-functional switch
- Highly-functional switch realizes barrier synchronization and high-speed reduction between nodes by hardware

## • Petascale system inherits Integrated Multi-core Parallel ArChiTecture

Suitable platform to develop and evaluate Petascale applications



# Integrated Multi-core Parallel ArChiTecture Introduction

#### Concept

Highly efficient thread level parallel processing technology for multi-core chip



#### • Advantage

Handles the multi-core CPU as one equivalent faster CPU

- Reduces number of MPI processes to 1/n<sub>core</sub> and increases parallel efficiency
- → Reduces memory-wall problem

## Challenge

How to decrease the thread level parallelization overhead?



# Integrated Multi-core Parallel ArChiTecture Key technologies

## CPU Technologies

Hardware barrier synchronization between cores

- Reduces overhead for parallel execution, 10 times faster than software emulation
- → Start up time is comparable to that of the vector unit
- Barrier overhead remains constant regardless number of cores





## <u>SPARC64™ VII</u>

Real quad-core CPU for Technical Computing (2.5GHz, 40Gflops/chip)

- Shared L2 cache memory(6MB)
  - → Reduces the number of cache to cache data transfer
  - Efficient cache memory usage

## Compiler technologies

Automatic parallelization or OpenMP on thread-based algorithm by vectorization technology



#### Integrated Multi-core Parallel ArChiTecture Outline of parallelization methods

Vectorization on •Legacy parallelization
•Fine-grain parallelization vector machine on scalar machine on scalar machine DO J=1,N DO J=1,N DO J=1,N DO I=1,M DO I=1,M DO I=1,M  $A(I,J)=A(I,J+1)^*B(I,J)$ Ρ A(I,J)=A(I,J)\*B(I,J) $A(I,J)=A(I,J+1)^*B(I,J)$ Ρ **END** END P Ρ **END END END** END Parallel Ν Serial Ν Ν Μ Μ Serial Μ Parallel Serial C Applicability : wide C Applicability : wide 🖰 Applicability : narrow (required wide range Synchronization : Cverhead : frequent analysis) but low cost frequent © Synchronization : occasional

Integrated Multi-core Parallel ArChiTecture takes cares of this weak point



#### Integrated Multi-core Parallel ArChiTecture, preliminary measured data Performance measurement by automatic parallelization

• LINPACK performance on 1 CPU(4 cores)

- n = 100 → 3.26 Gflops
- n = 40,000 → 37.8 Gflops (93.8%)

### • Performance comparison of DAXPY (EuroBen Kernel 8) on 1 CPU

- 4core + IMPACT shows better performance than
  - → 1core performance with small number of loop iterations
  - → X86 servers



#### Integrated Multi-core Parallel ArChiTecture, preliminary measured data Performance measurement of NPB on 1 CPU

- Performance comparison of NPB class C between pure MPI and Integrated Multi-core Parallel ArChiTecture on 1 CPU (4 cores)
  - IMPACT(OMP) is better than pure MPI for 6/7 programs



FUITSU

## FUĴĨTSU

## FX1 Intelligent Interconnect Introduction

- Combination of Fat tree topology InfiniBand DDR interconnect and the highly-functional switch (Intelligent switch )
- Intelligent switch
  - Result of the PSI (Petascale System Interconnect) national project

Functions

- Hardware barrier function among nodes
- Hardware assistance for MPI functions (synchronization and reduction)
- Global ping for OS scheduling

Advantages

- Faster HW Barrier speeds up OpenMP and data parallel FORTRAN (XPF)
- Fast collective operations accelerate highly parallel applications
- Reduces OS jitter effect



#### Intelligent Switch & its connection



#### FX1 Intelligent Interconnect High performance barrier & reduction hardware

• Hardware barrier and reduction shows low latency and constant overhead in comparison with software barrier and reduction\*.



\*: Executed by host processor using butterfly network built by point to point communication.



#### FX1 Intelligent Interconnect Stability of reduction function

 Intelligent interconnect realizes stable reduction performance by global ping function



Reduction (All reduce) performance on 128 nodes system



## **Technical Computing server roadmap**

#### • Development of the commodity based server and of the proprietary High End server for Technical Computing.





## Agenda

- Fujitsu's Approach for Petascale Computing and HPC Solution Offerings
- Japanese Next Generation Supercomputer Project and Fujitsu's Contributions
- Fujitsu's Challenges for Petascale Computing
- Conclusion



#### Japanese Next Generation Supercomputer Project\* Project Target

Source: RIKEN official report

~\$1.2 B

\*: Sponsored by MEXT (Ministry of education, culture, sport, science and technology)

**RIKEN Next-Generation Supercomputer R&D Center** 

#### Development & Application of Next-Generation Supercomputer Project by MEXT

FY2006: 3,547Million yen / FY2007: 7,736Million yen FY2006~FY2012 (total budget expected) about 110billion yen

1. Purpose of policy

Development and implementation of the world's most advanced and high-performance Next-Generation Supercomputer, and to develop and disseminate its usage technologies, as one of Japan's "Key Technologies of National Importance" (National Infrastructure).

In order to maintain world-leading position in variety of areas, the following academic-industrial collaboration activities will be conducted under the initiative of MEXT.

- (1) Development and implementation of the world's most advanced high-performance Next-Generation supercomputer
- (2) Development and dissemination of software that makes optimum use of the supercomputer
- (3) Establishment of the world's most advanced and highest standard supercomputing Center of Excellence, which includes the Next-Generation Supercomputer

#### 3. Project Framework

- Integrated development of computer and software
- Establishment of nationwide academic-industrial collaborative structure, with RIKEN as the project headquarters
- · A new law has been introduced for the framework of usage and administration





#### Japanese Next Generation Supercomputer Project **Project Outline**

System configuration

The hardware system consists of scalar and vector processor units.

- The target performance
  - 10PFlops on LINPACK BMT
- Contributor
  - Fujitsu, Hitachi and NEC join the project as the system developers.
- Schedule
  - Prototype system will be available for operation from the end of FY2010 and full system will be available from the end of FY2011.



Source: CSTP evaluation working group report

#### Japanese Next Generation Supercomputer Project Major Applications of Next Generation Supercomputer

FUÏTSU



#### Japanese Next Generation Supercomputer Project Basic Concept for Simulations in Nano-Science

#### Led by IMS (Institute for Molecular Science)

FUITSU



Through the courtesy of RIKEN

ORAP Forum, 9<sup>th</sup> Oct. 2008

RIKEN

All Rights Reserved, Copyright FUJITSU LIMITED 2008

23





## Agenda

- Fujitsu's Approach for Petascale Computing and HPC Solution Offerings
- Japanese Next Generation Supercomputer Project and Fujitsu's Contributions
- Fujitsu's Challenges for Petascale Computing
- Conclusion



## Fujitsu's approach for Scaling up to 10 Pflops



ORAP Forum, 9th Oct. 2008



## **Fujitsu's Challenges for Petascale Supercomputer**





## **Fujitsu's Challenges for Petascale Supercomputer**



Petascale system



ORAP Forum, 9th Oct. 2008

All Rights Reserved, Copyright FUJITSU LIMITED 2008



## Interconnect for parallel computer system

#### Interconnect type and its characteristic

| Interconnect type                                | Crossbar                   | Fat-Tree                              | Mesh / Torus               |
|--------------------------------------------------|----------------------------|---------------------------------------|----------------------------|
| Performance                                      | <b>©</b> (Best)            | O(Good)                               | ム(Average)                 |
| Operability and usability                        | ©(Best)                    | O(Good)                               | ×(Weak)                    |
| Cost, Packaging density<br>and Power consumption | ×(Weak)                    | ∆(Average)                            | O(Good)                    |
| Scalability                                      | Hundreds nodes<br>×(Weak)  | Thousands nodes $\Delta$ - O(AveGood) | >10,000 nodes<br>©(Best)   |
| Representative                                   | Vector Parallel PC cluster |                                       | Scalar Massive<br>parallel |

#### • Targeting over 10,000 nodes parallel system

- Cost, packaging density and power consumption are essential issues
- Too much number of hops are needed for Mesh interconnect.
  - → Torus interconnect is a strong candidate
  - The greatest challenge of Torus interconnect is operability and usability

### • Fujitsu challenges to develop an innovative Torus interconnect



## Fujitsu's Interconnect for Petascale computer system

## • Architecture

- Improved 3D Torus
- Switchless

## Advantages

- Low latency and low power consumption
- Scalability over 100,000 nodes
- High reliabilities and availabilities
- High density packaging
- Reduce wiring cost
- Simple 3D torus logical (application) view



### Improved 3D torus Architecture



## Agenda

- Fujitsu's Approach for Petascale Computing and HPC Solution Offerings
- Japanese Next Generation Supercomputer Project and Fujitsu's Contributions
- Fujitsu's Challenges for Petascale Computing
- Conclusion



## Conclusion

- Fujitsu continues to invest in HPC technology to provide solutions to meet the broadest user requirements at the highest levels of performance
- Targeting sustained Pflops performance, Fujitsu has embarked on the Petascale Computing challenge



## THE POSSIBILITIES ARE INFINITE